Skip to content

Conversation

Si1w
Copy link
Contributor

@Si1w Si1w commented Mar 18, 2025

This PR adds HF->GGUF conversion & inference support for PLM Model PLM-1.8B-Instruct

The Model has already been converted into gguf form with quantized and tested PLM-1.8B-Instruct-gguf, PLM-1.8B-Instruct-id-gguf

The Model Arch is similar with Deepseek V2 and Minicpm3. The key points of the model are:

  • Sparse FFN: PLM uses Squared ReLU (up and down projections)
  • MLA: PLM uses Multi-head Latent Attention

The details of the model can be seen in the following Paper

PLM: Efficient Peripheral Language Models Hardware-Co-Designed for Ubiquitous Computing


Self-reported review complexity:

  • Low
  • Medium
  • High

@github-actions github-actions bot added the python python script changes label Mar 18, 2025
@arch-btw
Copy link
Contributor

Tested both the premade gguf and converting the gguf, both work 👍

Looks like it's using the qwen2 tokenizer with the associated chatml prompt template:

llama_model_loader: - kv   0:                       general.architecture str              = plm
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = PLM 1.8B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = PLM
llama_model_loader: - kv   5:                         general.size_label str              = 1.8B
llama_model_loader: - kv   6:                            general.license str              = mit
llama_model_loader: - kv   7:                            plm.block_count u32              = 32
llama_model_loader: - kv   8:                         plm.context_length u32              = 4096
llama_model_loader: - kv   9:                       plm.embedding_length u32              = 2048
llama_model_loader: - kv  10:                    plm.feed_forward_length u32              = 8192
llama_model_loader: - kv  11:                   plm.attention.head_count u32              = 16
llama_model_loader: - kv  12:                plm.attention.head_count_kv u32              = 16
llama_model_loader: - kv  13:                         plm.rope.freq_base f32              = 100000.000000
llama_model_loader: - kv  14:       plm.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  15:                             plm.vocab_size u32              = 151936
llama_model_loader: - kv  16:                 plm.attention.kv_lora_rank u32              = 512
llama_model_loader: - kv  17:                   plm.attention.key_length u32              = 192
llama_model_loader: - kv  18:                 plm.attention.value_length u32              = 128
llama_model_loader: - kv  19:                   plm.rope.dimension_count u32              = 64
llama_model_loader: - kv  20:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  21:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  22:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  24:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  25:                tokenizer.ggml.eos_token_id u32              = 151643
llama_model_loader: - kv  26:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {% for message in messages %}{% if lo...
llama_model_loader: - kv  29:               general.quantization_version u32              = 2
llama_model_loader: - kv  30:                          general.file_type u32              = 15
llama_model_loader: - type  f32:   97 tensors
llama_model_loader: - type q4_K:  176 tensors
llama_model_loader: - type q6_K:   17 tensors

The only small error was a brief switch in language, but that's probably not related to this PR:

> hello!
通往成功

> How are you?
I'm doing well, thank you! How about you? How can I help you today?

convert_hf_to_gguf.py output:

python convert_hf_to_gguf.py /home/test/PLM-1.8B-Instruct --outtype f32
.....
INFO:hf-to-gguf:Set meta model
INFO:hf-to-gguf:Set model parameters
INFO:hf-to-gguf:gguf: context length = 4096
INFO:hf-to-gguf:gguf: embedding length = 2048
INFO:hf-to-gguf:gguf: feed forward length = 8192
INFO:hf-to-gguf:gguf: head count = 16
INFO:hf-to-gguf:gguf: key-value head count = 16
INFO:hf-to-gguf:gguf: rope theta = 100000.0
INFO:hf-to-gguf:gguf: rms norm epsilon = 1e-06
INFO:hf-to-gguf:gguf: file type = 0
INFO:hf-to-gguf:Set model tokenizer
INFO:gguf.vocab:Adding 151387 merge(s).
INFO:gguf.vocab:Setting special token type eos to 151643
INFO:gguf.vocab:Setting special token type pad to 151643
INFO:gguf.vocab:Setting special token type bos to 151643
INFO:gguf.vocab:Setting chat_template to {% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system
You are a helpful assistant<|im_end|>
' }}{% endif %}{{'<|im_start|>' + message['role'] + '
' + message['content'] + '<|im_end|>' + '
'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant
' }}{% endif %}
INFO:hf-to-gguf:Set model quantization version
INFO:gguf.gguf_writer:Writing the following files:
INFO:gguf.gguf_writer:/home/test/PLM-1.8B-Instruct/PLM-1.8B-Instruct-F32.gguf: n_tensors = 290, total_size = 7.3G
Writing: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7.30G/7.30G [00:11<00:00, 643Mbyte/s]
INFO:hf-to-gguf:Model successfully exported to /home/test/PLM-1.8B-Instruct/PLM-1.8B-Instruct-F32.gguf

@ngxson ngxson requested a review from ggerganov March 21, 2025 20:13
@Si1w
Copy link
Contributor Author

Si1w commented Mar 27, 2025

@ggerganov @slaren @ngxson I have already fixed the problem and tested models, could you help review again and merge? Thanks in advance

@Si1w Si1w requested review from ggerganov, ngxson and slaren March 27, 2025 09:52
@ggerganov
Copy link
Member

@Si1w What is the difference between the "instruct" and "instruct-id" models?

@Si1w
Copy link
Contributor Author

Si1w commented Mar 27, 2025

@Si1w What is the difference between the "instruct" and "instruct-id" models?

Basically, there is no significant difference but "instruct-id" is that model with identification i.e. the model knows that its name is plm.

@ggerganov
Copy link
Member

Let's merge if CI is green.

@ggerganov ggerganov merged commit f125b8d into ggml-org:master Mar 27, 2025
50 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants